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^> ' Abstract 

CO ' We present estimators for a well studied statistical estimation problem: the estima- 

tion for the linear regression model with soft sparsity constraints (l q constraint with 
' < q < 1) in the high-dimensional setting. We first present a family of estimators, 

r_ H ■ called the projected nearest neighbor estimator and show, by using results from Convex 

' Geometry, that such estimator is within a logarithmic factor of the optimal for any de- 

sign matrix. Then by utilizing a semi-definite programming technique developed in [41], 
we obtain an approximation algorithm for computing the minimax risk for any such es- 
timation task and also a polynomial time nearly optimal estimator for the important 
case of l\ sparsity constraint. Such results were only known before for special cases, 
despite decades of studies on this problem. We also extend the method to the adaptive 
£S) ' case when the parameter radius is unknown. 

> : 
\o ■ 

! 1 Introduction 

in . 

In the classical estimation problem with linear regression model, one observes a noisy y of 
some y G W 1 where y = X9 for a given n x p matrix X (called the design matrix) and 
fSJ ■ an unknown 6 £ W and wishes to estimate y or 6. Recently, there have been enormous 

interests in the high-dimensional setting which in addition assumes that the design matrix 
\ is high-dimensional, i.e. when p 3> n, and satisfies certain sparsity constraints. Such 

sparsity constraints can be "hard" , when it bounds the number of non-zero components in 
9, or "soft", when 6 is assumed to belong to the unit £ q ball for < q < 1. In the existing 
study, the focus has so far been on the condition needed for X such that certain (typically 
polynomial time) estimators are nearly optimal or achieve lowest possible error for the given 
parameters. The work along this line has been quite successful [19, 1, 2, 7, 16, 13, 17, 14, 5, 
11, 4, 43, 44, 34, 18] and produced many characterization of X (typically Gaussian random 
matrix) for which a polynomial time nearly optimal estimator exists. 

The main departure point of this study is that we consider the problem of designing 
nearly optimal estimator for any given design matrix X, i.e. we make no assumption about 
X. As the main contribution of this paper, we present a family of estimators, which we 
call the projected nearest neighbor estimator (PNN), and show that for any design matrix 
X, there is a projected nearest neighbor estimator that is nearly optimal in terms of the 
prediction risk for the corresponding linear regression problem over soft sparsity constraints. 
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As a consequence, we obtain a polynomial time algorithm to compute the approximate 
minimax risk for any such problem and a polynomial time estimator in the important case 
of q = 1. Our results represent the first provably nearly optimal estimators without any 
constraint on the design matrix for < q < 1. We also design an adaptive estimator for 
the case when the l\ radius is not given. 

We believe that studying optimal estimator for arbitrary X is important for multiple 
reasons. First, in practice we often do not have control over the design matrix or even the 
distribution of the design matrix. The design matrix might be "ill" -conditioned such that 
no estimator can achieve good accuracy. On the other hand, the design matrix may have a 
structure, as is often the case in practice, rather than completely random. In this case, it 
is important to take advantage of such structure to obtain better accuracy. Secondly, while 
there have been many characterization (typically some isometry property on X) known for 
certain algorithms to work well, it is often difficult to tell if the required property holds for 
a given X. So most results assume that X come from Gaussian random matrix. Thirdly, 
relaxing the requirement about the design X calls for the development of new algorithms 
as well as new analysis tools. Indeed, to argue the optimality of our estimator, we have 
to utilize novel tools from Convex Geometry (the classical restricted invertibility result by 
Bourgain and Tzafriri [8]). 

1.1 Problem setup 

In the linear regression problem, one observes y = y+g £ M™, where y = X9 for a given nxp 
matrix X and an unknown vector 9 G £ q (C) for < q < 1, where £ q (C) = {(Oi, ■ ■ ■ , P ) : 
(Ei \ e i\ g ) 1/q < C}- In addition, the noise g is a random vector drawn from the multivariate 
Gaussian distribution with the covariance matrix a 2 1. In this paper, we only consider the 
prediction estimation, i.e. on the estimation of y but not 6. We use the standard total 
squared loss 1 to measure the error of an estimation, i.e. 

loss(y,y) = ||y-y|| 2 = - Vi? ■ 

i 

For an estimator M : M. n — > R™, we define the expected error of M on an input y and 
on Gaussian error as 

err M (y, a) = ^y= y+g ;g~g(a) loss(M(y), y) = Ey =y+g . g ^ g{(7) \\M(y) - y\\ 2 . 

Following [20], for K C W 1 , the risk of M over K is defined as 

Rm(K, a) = sup en M (y, cr) . (1) 

Define the minimax risk, denote by R*(K,a), as the minimum achieve-able risk among 
all the possible estimators, i.e. 

R*(K,a) =mfR M (K,a). (2) 

M 

For the aforementioned linear model with sparsity constraint ( q (C), we have K = 
X£ q (C) for annxp design matrix X. Clearly, the minimax risk R* ranges between 

1 We use the total squared error instead of the common mean squared error purely for the brevity of 
notation. 
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and no 1 and depends on the structure of X. The main goal of this paper is to design an 
estimator M such that Rm {X £ q (C) , a) is close to R* (X£ q (C), a) for any given X. For our 
main results, we consider the case where the sparsity radius C is given. Since we will only 
consider the prediction risk, we can assume, by rescaling X, that C = 1. In what follows, 
we write £ q for £ q (l). In addition, we only consider the high dimensional case where p > n 
because for p < n, we can apply a rotation to the design matrix so that the last n — p rows 
are entirely 0. Since Gaussian noise is invariant under rotation, this does not affect the 
minimax risk, and the dimensions of the design matrix is effectively reduced to p x p. 

1.2 Main contribution 

We present a family of estimators, called the projected nearest neighbor estimator (PNN), 
that can achieve nearly optimal risk for any design matrix X and any given < q < 
1. The projected nearest neighbor estimator is a combination of two classic estimators: 
the orthogonal projection estimator, in which the estimation is obtained by projecting the 
observation y to a properly chosen subspace, and the nearest neighbor estimator, in which 
y is mapped to the closest point (in terms of £2 distance) on the ground truth set K. The 
projected nearest neighbor estimator is defined with respect to an orthogonal projection P. 
It is the summation of two components: one, similar to the orthogonal projection estimator, 
is the projection Py of y by P; the other, similar to the nearest neighbor estimator, is the 
nearest neighbor projection of P^y on P^K, where P ± is the projection orthogonal to P. 
As the main contribution of this work, we show that for any X, < q < 1, and a > 0, there 
always exists a projection P so that the corresponding projected nearest neighbor estimator 
for K = Xl q is nearly minimax optimal. More precisely, we show that 2 

Theorem 1. For any given nx p matrix X, < q < 1, and a > 0, there exists a projected 
nearest neighbor estimator M such that 

R M (X£ q ,a) = 0(c q (\og 1 -^ 2 p)R*(Xe q ,a)) , 

1 , „ 

where c q = 0(2i - In -) is a constant dependent on q only. 

In the above theorem, the projection P is chosen in two steps: 1. for each < k < n, 
a k dimensional projection P^ is chosen to minimize maxj HP-'-Xj || where Xj's are column 
vectors of X; 2. a proper k* is chosen to minimize the risk among all the P^s. Finding 
the projection in Step 1 turns out to be NP-hard. However, by using the semi-definite 
programming technique in [41], we can compute an approximately optimal projection and 
therefore an approximate minimax risk in polynomial time. 

Theorem 2. For any given n x p matrix X, < q < 1, and a > 0, we can compute 
an 0(c q logp) approximation 3 of R*(X£ q ,a) in polynomial time. When q = 1, there is a 
randomized polynomial time estimator that is within O(logp) factor of the optimal. 

The above two results assume that the radius of £ q ball is given. For q = 1, we can 
extend the estimator to the adaptive case when \\9\\i is unknown. Using the similar idea to 
the projected nearest neighbor estimator, we have that 

2 Throughout this paper, the O notation only hides some absolute constant, i.e. a constant independent 
of any of the parameters, such as n,p, q, a, X, 9, y. 

3 A quantity a is a c-approximation of a* > 0, if a* < a < ca* . 
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Theorem 3. There is a polynomial time adaptive estimator A such that for any given nxp 
matrix X, 6, and a > 0, 

err A (Xe,a) = 0(logp • R* (X£\ ( || $|| i ) , c) + ^Jnlogna 2 ) . (3) 

Notice that the first term of the above error is 0(logp) factor within the oracle risk 
bound when ||0||i is given. While we do not quite get the true oracle bound due to the 
presences of the additive term of \Jn log na 2 , the bound becomes a true (and non-trivial) 
oracle bound for a rather large range of ||0||i. See Remark 7 for a more detailed discussion. 

1.3 Intuition 

We provide some high level intuition of the projected nearest neighbor estimator. The 
orthogonal projection estimator, by projecting the observation to a chosen subspace, ef- 
fectively identifies the "leading factors" in the ground truth set. It works well when K is 
"skewed". However by simple projection, it ignores the detailed local geometry of K. This 
makes it less effective when K has many constraints or has constraints involving many di- 
mensions, e.g. when K satisfies sparse constraints. On the other hand, the nearest neighbor 
estimator, by projecting to the nearest neighbor, depends more on the local geometry of 
K. But it ignores the global geometry of K so it works well when the body is not skewed 
along any direction. In some sense, the projected nearest neighbor estimator achieves the 
optimality by taking both global and local geometry into account: it first identifies the 
skewed dimensions and then applies the nearest neighbor estimator to the "residual" space 
which is less biased. 

It is long known that the nearest neighbor estimator may be far away from the optimal 
when there is strong correlation among column vectors of the design matrix X [21, 22, 45]. 
There have been many methods proposed to deal with this problem. The projection phase 
can be viewed as one way to remove the correlation such that the residual vectors are less 
biased. This might not be obvious as the projection only minimizes the maximum of li norm 
of the projection, a seemingly different quantity. However, in order for the the projected 
vectors to be all short, they necessarily "span" all the directions because otherwise we could 
"tilt" the projection to reduce the longest projection. This intuition can actually be made 
rigorous with the help of tools from Convex Geometry [8]. 

The technical analysis of the projected nearest neighbor estimator is inspired by two re- 
cent works, one is the analysis on the nearest neighbor estimator by Raskutti, Wainwright, 
and Yu [34]; the other is on the optimality of the orthogonal projection estimator by Javan- 
mard and the author [26]. In [34], it is shown that if X satisfies a certain isometry property, 
then the nearest neighbor estimator is close to optimal. On the other hand, [26] shows 
that for symmetric convex bodies there always exists a projection such that the orthogonal 
projection estimator is close to optimal. At the very high level, we combine the analysis of 
these two results and show that there always exists a nearly optimal projection of X such 
that the bound in [34] is nearly optimal on the projected body. 

While the main machinery in our analysis is similar to what is in [34] and [26], we 
need further insights for our problem. For the nearest neighbor analysis, we need a slightly 
different analysis than [34] to obtain an upper bound suit our purpose. This also allows our 
result hold for all ranges of p, n. The lower bound is obtained by extending the techniques 
in [26] to the sets of the form Xt q for < q < 1. The technique utilizes some classical 
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results from Banach space geometry, first started by Bourgain and Tzafriri [8] and fully 
developed by Szarek, Talagrand, and Giannopoulous [36, 23]. 

Despite its somewhat involved analysis, the projected nearest neighbor estimator sug- 
gests a quite natural heuristic: project K = X£ q to a subspace to make it more "round" 
before applying other estimators (in our case the nearest neighbor estimator). This ap- 
proach is probably already being used in practice. As the main result in this paper, we 
prove that such heuristics can actually lead to a nearly optimal estimator. In addition, a 
nearly optimal projection can be found in polynomial time via semi-definite programming 
technique in [41]. 

For the adaptive estimator, we consider the case of q = 1. The well known Lasso [38] 
and Dantzig selector [14] can be viewed as the adaptive version of the nearest negibhor 
estimator. According to [5], these estimators can achieve an error bound dependent on 
|| 6 ||i, which is the same as the oracle risk bound of PNN when the projection is taken as 
the identity projection. We can apply Lasso or Dantzig selector to the projection of X and 
to obtain the oracle risk bound of PNN under different projection dimensions. This way, 
we can obtain a set of estimations among which one achieves the true oracle risk bound! 
Unfortunately, we cannot reliably determine which one it is. By using ideas from hypothesis 
testing, we can only choose one within 0{y/n log na 2 ) error, which accounts for the additive 
bound in Theorem 3. 

More concretely, in PNN, the optimal projection dimension is a staircase function of the 
parameter radius. So we try to "guess" ||0||i at those critical values at which the optimal 
dimension changes value. The problem then reduces to a hypothesis testing problem on 
whether y = X9 belongs to some convex body. By using the statistics of \\y — y\\ 2 , we 
can achieve the claimed bound. Our procedure is similar in spirit to the classical Lepski's 
recipe [28, 6] for converting a non-adaptive estimator to an adaptive one. But there is 
a significant difference as the PNN estimator is non-linear, and the projections at differ- 
ent dimensions lack a nested structure. As a result, our bound leaves an additive gap of 
\Jn log na 2 . 

1.4 Related work 

There are vast amounts of work on the minimax risk estimator. We refer to [30, 39, 27] for 
comprehensive surveys. Despite many studies on this subject, optimal or nearly optimal 
estimators are only known for special types of bodies. 

One particularly interesting case is when the parameter space is sparse. It is long known 
that no linear estimator works well under such constraints (see for example [20]). Instead, 
one needs non-linear estimator such as the thresholding estimator to achieve nearly optimal 
risk. Recently, much attention has been paid to the (hard) sparsity constraint defined as 
the number of non-zero components, dubbed as £q quantity, of a vector. This problem, 
called compressive sensing in the literature, is computationally infeasible in general so the 
study has focused on the condition under which nearly optimal polynomial time estimator 
exists [1, 2, 7, 16, 13, 14, 11, 4, 43, 44]. 

The case of q = 1 is closely related to Lasso [38], which is the nearest neighbor estimator 
for the case of q = 1 and later evolves to solving a regularized nearest neighbor problem 
with the l\ norm penalty. While Lasso has proved to be very effective, it is known that 
when the design matrix has strong correlation, the Lasso estimator may not produce a good 
estimation [21, 22]. Various methods have been proposed to remove the correlations [21, 



5 



22, 45] by using different penalty terms. The projected nearest neighbor estimator can 
also be viewed as a way to remove correlation. The difference is that our method can 
be shown to be close to the optimal solution for any design matrix X. In the projected 
nearest neighbor estimator, we choose the projection dimension that balance two error 
terms. Similar technique has appeared before. For example, in [3], the estimation is chosen 
among greedy approximations of the span of vectors of varying size, and the optimal choice 
is by balancing two error terms. In [10], the dimension is controlled by a stopping rule 
dependent on the noise structure. Despite these similarity, the optimality of the projected 
nearest neighbor estimator requires careful choice of the projection via solving a semi- 
definite program. It is unlikely that the greedy algorithm can achieve the same goal. On 
the other hand, the computational efficiency of the greedy algorithm makes it (or some 
variation) an attractive practical alternative to the more complex projection phase in this 
paper. 

Many authors also consider (arguably more flexible and realistic) soft sparsity con- 
straints in the form of 9 G i q for < q < 1, the setting considered in this paper. In [19], 
asymptotically tight bounds are obtained for X = I, the identity matrix. A similar notion 
of roughness was studied in [29] in which soft-thresholding estimator is shown to be nearly 
optimal, again for X = I, but extended to more general noise and loss models. In [17], it 
is shown that there exists design matrices X which allow fairly accurate estimation when 
there is no noise. In [42], the authors presented several upper bounds, dependent on the 
design matrix X, on the loss of the Lasso and Dantzig selector methods when applied to soft 
sparsity constraints. Then in [34], it is shown that the nearest neighbor estimator is nearly 
optimal if X satisfies certain isometry property which holds for Gaussian random matrix X. 
In [18], it is shown that for Gaussian random matrix, the (polynomial time) l\ penalized 
least squares is nearly optimal. Despite all these studies, no nearly optimal estimator is 
known for general design matrix X. So our knowledge is limited to the case where X is a 
diagonal matrix or when X satisfies strong isometry properties. In [12], the authors showed 
a lower bound of the minimax risk on the estimation of 8 for any design matrix and with 
the hard sparsity constraint, but it could be far away from the upper bound in general. 

Among previous work, [34] is particularly relevant to our current work. In [34], the 
authors show, among many other results, an upper bound for the nearest neighbor estimator 
which depends on q and the radius of K. While this could be far away from the optimal, 
it turns out if we apply proper projection of K, the radius of the projection can be made 
so that the resulted bound is near optimal. For this we follow similar approach in [26], in 
which they show that the orthogonal projection estimator is nearly optimal for symmetric 
linear constraints. But we need to adapt the argument in [26] as Xi q have exponentially 
many faces and can be non-convex. 

As mentioned earlier, the transformation from non-adaptive estimator to the adaptive 
one is similar to Lepski's method [28, 6] but there are significant differences as our non- 
adaptive estimator does not quite satisfy the properties required by Lepski's method. 

2 Preliminaries 

2.1 Basic notations and definitions 

For a vector x = (xi, . . . , x p ) G W and q > 0, denote by = (^ Ixjl 9 ) 1 ^. When p > 1, 
is a norm. When < q < 1, ||x|| ? is not a norm but it is quasi-convex as there is a 
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constant c dependent on q such that for any x,y, \\x + y\\ q < c(\\x\\ q + \\y\\ q ). We use £q(r) 
to denote the p-dimensional (/-ball with radius r, i.e. 

£P(r) = {i£f : < r}. 

We often drop p when the dimension is clear from the context. We use t q as a short 
hand for l q {i). For a set K C W 1 containing the origin, define the g-radius of K as 

HAIL 

= sup^g^- \\x\\ q . In all these notations, whenever q is omitted, it means q = 2. 
We use G n (<r) to denote the distribution of n-dimensional Gaussian random variable 
with covariance matrix o 2 I. Again, we often drop n and a when they are clear from the 
context. 

As standard, f = 0(g) if there exists a constant c > such that / < c • g and f = Q(g) 
if there exists a constant c > such that / > c ■ g. Throughout this paper, high probability 
is understood as the probability of 1 — 1/n 2 . 

2.2 Minimax risk 

An estimator M is a map from W 1 to M. n : it takes a noisy observation y = y + g of an 
unknown vector y £ W 1 and maps it to an estimation y = M(jj). Here we consider the noise 
drawn from Q n (a). As described early, the risk Rm(K, a) of M is defined as the maximum 
expected error among y in K, i.e. 

R M (K,a) = sup Ey =y+ g.g^g {a) [\\M(y) - yf] . 
y eK 

The minimax risk of K is defined as the minimum achievable risk for K, i.e. R*(K, a) = 
inf jvf Rm(K, a). We state a well known lower bound on the minimax risk of Euclidean balls 
which we will use later. 

Lemma 4. R*(£%(r),cr) = Q(mm(na 2 , r 2 )). 

2.3 Orthogonal projection estimator 

The orthogonal projection estimator T is a special type of linear estimator. It is defined with 
respect to some linear subspace. The estimation is simply by projecting the observation 
y G W 1 to the subspace. Let Vk denotes all the fc-dimensional linear subspaces in W 1 . For 
P £ Pi, we also use P denote the orthogonal projection to P. The estimator Tp is then 
defined as Tp(y) = Py. 

Since Gaussian random vector is invariant under the rotation, we have that Rt p (K, a) = 
ka 2 + sup^g^ \\y — Py\\ 2 = ka 2 + sup^g^ HP- 1 ?/!! 2 , where P 1 - denotes the n — k dimensional 
subspace orthogonal to P. For < k < n, define Kolmogorov width (as in [31]) as 

dk{K) = inf sup \\y - Py\\ . 

For I2 norm, this definition is equivalent to following more convenient form, which we 
will use through the paper. 

d k (K) = inf ||P ± (K)||= inf \\P(K)\\. 

1^1 k ' n — k 

Clearly, dk(K) is monotonically decreasing with k. Kolmogorov width determines the 
minimax risk of the orthogonal projection estimators [20]. Let Rt denote the minimum risk 
among all the orthogonal projection estimators. 
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Lemma 5. Rt(K,g) = mkn^fccr 2 + dk{K) 2 ). 

The orthogonal projection estimator is long known to be nearly optimal for ellipsoids [32, 
25] and more generally for quadratically convex and orthosymmetric objects [20]. However, 
it is also well known that the orthogonal projection estimator (actually any linear estimator) 
can be far away from optimal for the i\ ball and therefore does not work well for linear 
regression with sparsity constraints. 

Lemma 6 ([20]). 

R T (Pl, l/v 7 ^) = n^n/ log nR*(£^, l/y/n)) . 
2.4 Nearest neighbor estimator 

The nearest neighbor estimator is another well known estimator. It maps an observation to 
the nearest point on K, i.e. Nx{y) = argmin^ g ^- ||y — y\\. The nearest neighbor estimator 
is a non-linear estimator and works well for "skinny" objects such as the i\ ball. However, 
we can construct an example (Section 6.1) to demonstrate it is far from optimal. Denote 
by Rn{K,<j) the risk of the nearest neighbor estimator. 

Lemma 7. There exist ellipsoids E n C W 1 for n = 1,2,... such that R^{E n ,\) = 
Q{^R*(E n ,l)). 

3 Projected nearest neighbor estimator 

We now describe the projected nearest neighbor estimator, which is defined with respect 
to some low dimensional orthogonal projection. Given a fc-dimensional subspace P £ Vk, 
we define the projected nearest neighbor estimator Hp as follows. Let P 1 - denote the 
n — k dimensional subspace orthogonal to P. Recall that we also use Px, P^x to denote, 
respectively, the orthogonal projection to the space P and P 1 - . The estimator Hp is defined 
as 

Hp(y)=Py + N P±K (P ± y). 

In other words, Hp consists of two components, one of which is the projection to the 
subspace P and the other the nearest neighbor of P^y to P^K. We use Rh(K,o~) = 
infq RH P (K,a) to denote the minimum risk achievable by the projected nearest neighbor 
estimator for given K, a. 

When the projection is set as the identity projection, the corresponding PNN is the 
same as the nearest neighbor estimator. In addition, for the same projection, the projected 
nearest neighbor estimator outperforms the corresponding orthogonal projection estimator. 
So the projected nearest neighbor estimator subsumes both the nearest neighbor and the 
orthogonal projection estimators. In the following, we give an example to show the projected 
nearest neighbor estimator can outperform both the orthogonal projection and the nearest 
neighbor estimators by a large factor. 

Example 8. Consider the ellipsoid defined as 

^ k n 

E n ,k = {x : ^=J>* + E ^< !}• 
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Let K = E n 2 k x £i(y/n) with k < n. By the above discussion, we can see that for the 
orthogonal projection estimator Rt(K, 1) = 0(n), and for the nearest neighbor estimator 
Rn(K,1) = 6(n), but Rh(K,1) = O(y/n\ogn) by setting P to be the fc-dimensional 
projection spanned by the k long axes of E n i k . This demonstrates a large gap between the 
projected nearest neighbor estimator and both the orthogonal projection and the nearest 
neighbor estimators. 

To study the performance of the projected nearest neighbor estimator. We first need 
the following error bound for the nearest neighbor estimator from [34]. 

Proposition 9. For < q < 1, K = X£ q , the nearest neighbor estimator N has risk 

R N {K,a)=0{c q \\K\\ q a 2 - q {\ogp) 1 - q / 2 ), 

1 -, „ 

where c q = 0(2i - In -) is a constant dependent on q only. 

The above bound is almost identical to Theorem 4(a) in [34]. We will present a slightly 
different proof which applies to wider combination of parameters. For clarity and complete- 
ness, we present the proof in Section 6.2. According to Proposition 9, the error is bounded 
by || K \\ q . Hence, if we fix the dimension of the projection in a PNN estimator, in order 
to minimize the risk, we should seek the projection P that minimizes ||Pif||, i.e. realizes 
Kolmogorov width. By using this projection, we obtain the following upper bound of the 
projected nearest neighbor estimator. 

Corollary 10. For any < q < 1 and any K = X£ p q , 

R H (K, a) = 0{ min (ka 2 + c q d k (K) q a 2 - q (log pf- q ' 2 )) , (4) 

0<k<n 

where c q is the same as in Proposition 9. 

Proof. For any fixed k, the error consists of two terms: 0(ka 2 ) for the projection, and 
0(c q dk(K) q o 2 ~ q (\ogp) l ~ q l 2 ) for the nearest neighbor estimation. The second term comes 
from Proposition 9 with \\K\\ replaced by d k (K) if we apply the projection that realizes 
d k (K). Clearly, we can choose k with the minimum bound. □ 

To show (4) is nearly optimal, we prove an almost matching lower bound in terms of 
the Kolmogorov width. This is the key technical contribution of the paper and relies on the 
classic restricted invertibility property developed by Bourgain and Tzafriri [8]. The proof 
of is in Section 6.3. 

Theorem 11. For K = Xl v q , 

R*(K, a) = fi( max mm{ka 2 , k l ~ 2 l q d k {K) 2 )) . (5) 

0<k<n 

Theorem 1 follows readily from Corollary 10 and Theorem 11 by setting k to equalize 
two terms in (5). The details are in Section 6.4. 

Remark 1. In the proof of Theorem 1, we choose k* such that d k (X) ps k x l q a. When q 
goes to 0, then k* goes to 1. Therefore, when q is close to 0, the projected nearest neighbor 
estimator becomes the ordinary nearest neighbor algorithm. As stated in Theorem 4(b) 
in [34], the risk of the nearest neighbor estimator is 0(s \og(p/s)a 2 ) for 6 G £q( s )- On the 
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other hand, if the rank of X is at least s, then R*(X£q(s), a) = Q(sa 2 ). Hence the nearest 
neighbor estimator (and the projected nearest neighbor estimator) is O(logp) minimax for 
the hard sparsity constraint. This is consistent with the bound in Theorem 1 by letting 
q -> 0. 

Remark 2. In the proof of Theorem 11, we actually showed that there exists a submatrix 
X' which consists of k < n columns of X such that the minimax risk of X' £ k q is close to 
that of X£q. In some sense, this means that there is a hardest sub-problem which has at 
most n columns. 

Remark 3. Our technique still leaves a gap of (log p) l ~ q l 2 . We do not know if this gap 
is inherent to the projected nearest neighbor estimator or due to the deficiency of the 
analysis. We note that the upperbound cannot be improved in general, as demonstrated by 
the example of i\ ball. There might be a chance to improve the lowerbound by a factor of 
v/log k by more sophisticated techniques. But this is still insufficient to close the gap as k 
might be much smaller than p. 

Remark 4. While PNN may sound similar to the technique of low dimension projection, 
there are significant differences. For example, when applying low dimension projection, we 
typically would like to preserve the original metric structure, and often a random projection 
suffices. In our case, however, we would like to make the projection as small as possible, 
and it requires more careful selection of the projection. Indeed, it is easy to show that a 
random projection would fail for our purpose. 

4 Algorithms 

While the analysis of projected nearest neighbor estimators is somewhat involved, the re- 
sulted algorithm is quite straightforward. There are two separate parts in the projected 
nearest neighbor estimator. First, for given K and a, compute the optimal projection P 
and k. Second, for any observation y, apply the projection and then compute the nearest 
neighbor of P^y to P^K. 

We will describe these two steps separately. For the first step, by the proof of Theorem 1, 
it suffices to compute dk(K). This problem is however NP-hard [9]. But since K = X£ q , 
WP-^K]] must be realized at one of p column vectors of X (see the proof of Lemma 19). Let 
V = {xi : i = 1, . . . ,p} be the p column vectors of X. Then computing d^{K) reduces 
to computing &n n — k dimensional projection P 1 such that max{||P'v|| : v £ V} as 
small as possible. This problem has been studied in [41], and it is shown one can compute 
an 0(y/\ogp) approximation by the semi-definite programming relaxation. The following 
proposition is the main result of [41]. 

Proposition 12. For any n x p matrix X, < q < 1, and < k < n, we can compute in 
polynomial time an 0(\J\ogp) approximation to dk(X£ q ). In addition, we can compute an 
n—k dimensional subspace P' in randomized polynomial time such that with high probability, 
\\P>(X£ q )\\=0(Vtegpd k (Xl q )). 

As for the second step, we need to compute the nearest neighbor on K = X£\ for 
any given point. This can be done by convex programming for q = 1. Unfortunately, we 
do not know how to compute it efficiently for q < 1. So we can only claim polynomial 
time nearly optimal estimator for K = X£\, as described in Algorithm 1. For description 
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simplicity we have described the algorithm in which we try all k = 1,2, ...,n. Since 
dk(K) is monotonically decreasing, the complexity can be reduced by using a binary search. 
Theorem 2 follows from the above discussion. 

Algorithm 1 Nearly optimal estimator for Xl\. 
Input: design matrix X and observation y. 
Output: y. 

1: Let xi,...,x p be column vectors of X. Denote the set by Y; 
2: for k G {1, . . . ,p} do 

3: Compute a projection Pk such that Zk = \\PkY\\ = 0(\ / logp)dk(K); 
4: Compute rk = ka 2 + Zk&y/log p; 
5: end for 

6: Pick k* = argmin fe r^, and let P = Pk* and P ± be the subspace orthogonal to P; 

7: Compute y 1 as the nearest neighbor of P^y to the convex hull of ±i- > - L xi, . . . , ±P^x p . 

This can be done by using any polynomial time convex programming algorithm. 
8: Set y = Py + y l . 



The following proof summarizes our above discussion. 

Proof. [Theorem 2] By Proposition 12, we can compute an 0(\/logp) approximation d' k 
of dk{Xi q ). Using this approximation, we compute 

R' = 0{ min {ka 2 + c q ^a 2 ' q (log p) 1 '^ 2 )) . 

0<k<n 

Since dk(K) < d' k < c \/log pdk(K) for some constant c > 0, we have that 
Rh(K, a)<& = c q \og q/2 pR H {K, a) . 

By Theorem 1, Rh is an 0((\ogp) l ~ q t 2 ) approximation of R* , so R' is an 0((\ogp) q l 2 (\ogp) 1 
0(logp) approximation of R* . 

When q = 1, by Proposition 12, we can compute the nearly optimal projection P and 
use convex programming to compute the nearest neighbor of Py to PXi\ . The former can 
be done in randomized polynomial time and the latter in polynomial time. □ 

Remark 5. The first step of the algorithm uses the semi-definite programming relaxation to 
compute a nearly optimal projection of Xt q . While it has guaranteed approximation ratio, 
it can be time consuming. In practice, the projections on the principal subspaces of XI2 
might serve as a good heuristics. 

Remark 6. We do not have a polynomial time estimator for < q < 1 because of the lack 
of a polynomial time algorithm for computing the nearest neighbor to the non-convex body 
of K = Xt q . While such nearest neighbor problem is hard, for our purpose an approximate 
nearest neighbor is sufficient. In addition, we only need to succeed in an average sense as 
y = y + 9^ory£K and g an i.i.d. Gaussian noise. It is interesting to know if there exists an 
efficient procedure in this particular setting. We note that this problem can be formulated 
under the framework of the smoothed analysis [35]. In both cases, we are interested in 
minimizing the expected performance of an algorithm (or an estimator) in the worst case. 
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5 Adaptive estimator when C is not given 



The projected nearest neighbor estimator in the last section is nearly minimax optimal once 
the sparsity radius is given. In this section, we extend the same idea to design an adaptive 
estimator to deal with the case when the sparsity radius is not known. Write C = ||#||i- 
Ideally, one would like to achieve some kind of oracle inequality with the error bound 
proportional to R* (Xl\ (C) , a) , i.e. the nearly optimal risk bound assuming C is available. 
We can only partially achieve this goal with an extra additive term of \Jn log na 2 . Here we 
will focus on the case of q = 1 for the simplicity of the exposition. 

Again let K = Xl\. Intuitively, the adaptive estimator will search for the unknown C 
at some discrete values. In view of the upper bound in Corollary 10, we will only try those 
C's which equalize the two error terms in (4). 

Define C k = ka/d k (K) for k = 0, 1, • • • , n/2. C k has the following properties: 

1. Co < C\ < C2 < • • • is monotonically increasing, since d k is non- increasing. 

2. There is a constant c > 0, for C > C k , 

R*{Xl l (C),a)>cka 2 . (6) 

This follows from Theorem 11. 

Further we define P k to be the n — k dimensional projection that realizes d k (K), i.e. 
minimizes maxi<j<„ ||Pa;j|| among all the n — k dimensional projection. The adaptive 
estimator will estimate y k = P k y against P k X£i(C k ) using the nearest neighbor estimator, 
starting from k = 0. Suppose that the outcome is yu- It is easy to show that among the 
n estimations y^ for k = 0, . . . , n, there is one that satisfies the true oracle risk bound, i.e. 
with high probability, there exists < k < n such that 

Wvk - y\\ 2 = o(0og^r(^i(l|0i||), a)) . 

Unfortunately, we cannot determine reliably which one it is. Instead, we can only 
choose one which is within 0(\/n log na 2 ) error. This is by finding the minimum k such 
that \\yk — yk\\ 2 is not too large (defined precisely later). Algorithm 2 contains a formal 
description. 

Now we will show that the estimator given in Algorithm 2 satisifies the bound stated in 
Theorem 3. The proof requires some properties on — yk\\ 2 as described in Lemma 13. 
Denote by yk = PkV and = PkXl\(Ck)- Let <5& denote the £2 distance between y^ and 
K k , i.e. 5 k = min zeKk \\y k - z\\. 

Lemma 13. There are constant c±,C2 > such that the following holds with high probability 
1. If y k e K k , then 

\\% ~ Vk\\ 2 <(n- k)a 2 + 2-y/nlog na 2 . 
2- If 5\ > c\{yjn log na 2 + ka 2 logp), then 

\\Vk - Vk\\ 2 >(n- k)a 2 + 2y / n\og na 2 . 

3. If Si < ci{yjn log no 2 + ka 2 logp), then 

\\Vk ~ VkW 2 < c 2 {\/n log na 2 + ka 2 logp) . 



12 



Algorithm 2 Adaptive projected nearest neighbor estimator 
Input: design matrix X and observation y. 
Output: estimation y. 
l: for k € {0,1,- ■■ ,n/2} do 

2: Compute the n — k dimensional projection P k that approximately minimizes ||-PA 
3: Compute y k = P k y, X k = P k X, and A k = maxj P k Xi] 
4: Set C k = ka/A k 

5: Compute y k to be the nearest neighbor of y k on Xfc£i(Cfc) 
6: if \\y k — y k \\ 2 < (n — k)a 2 + 2-y/nlog na 2 then 
7: Set y = y k + P^-y and return; 
8: end if 
9: end for 
10: Set y = y. 



By Lemma 13.1 and 2, Step 6 in Algorithm 2 serves as a test for whether y k is sufficiently 
separated from K k . When y k 6 if^, then the test is true with high probability, and the 
algorithm outputs y and returns. But when the separation between y k and K k is large 
enough {c\{y/n log no 1 + ka 2 logp)), then Step 6 would test false with high probability. 
Theorem 3 follows from Lemma 13. 

Proof. [Theorem 3] If the test at Step 6 outputs false for some k, then by Lemma 13.1, 
y k ^K k . Thus y £ Xh{C k ), i.e. C > C k . By (6), we have that R*{Xh{C), a) > cka 2 . 

On the other hand, if Step 6 tests true for k, then by Lemma 13.2, d 2 , < ci(\/n log na 2 + 
A:cr 2 logp), and by Lemma 13.3, y returned at Step 7 satisfies that 

— =||2/Jfc — 2/fcll +^0" <c 2 (Vnlogno- + ka log p) + ka . 
We distinguish three outcomes of Step 6. 

• Step 6 tests true for k = 0. In this case, 

||y — y\\ 2 < C2 \/ n lo g n<j2 • 

• Step 6 test true for some k > and therefore is false for k — 1 . In this case 

iT(A7i(C),<7) > c(fc- l)a 2 , 

and 

||y — 2/|| 2 < log ncr 2 + /c<7 2 logp) + fccr 2 

= O ( ^nlogncj 2 + i?* (X^i (C) , <r) log p) . 

• Step 6 is never true so Step 10 is reached. In particular, the test is false for k = 
n/2 and hence R* (Xl.i{C),a) > a{n/2 - l)a 2 but then \\y - y\\ 2 = 0(na 2 ) = 
0(R*(Xh(C),a)). 

In all the above cases, the bound in Theorem 3 holds. □ 



13 



Remark 7. When R*(X£i(\\8\\i),a) > y/na 2 , the bound (3) in Theorem 3 becomes a true 
oracle risk bound (within 0(log p) factor). In view of the proof of Theorem 1, this happens 
when ||0||i<i^(X£i) > y/na, i.e. when \\6\\i > \/na / 'd ^(X£±) . In such case, the risk ranges 
between y/ n log no 2 and no 2 . So the bound (3) is nearly optimal and non-trivial for a rather 
large range of ||0||i. 

Remark 8. It might be possible to apply the Lasso or Dantzig selector estimators to the 
projection P^X to obtain y^ and then choose one y^ similar to Algorithm 2. This would 
probably result in the same bound as in (3). We choose our current exposition because 
Lemma 13.2 relies on the fact that y^ is the nearest neigbhor to Pkjj- It is not immediately 
clear whether it also holds for Lasso or Dantzig selector. 

Remark 9. One may wonder if it is possible to get rid of \J n log no 2 factor and obtain a pure 
oracle inequality bound. If such a bound is possible, then when C = 0, the estimator needs 
to map all the observations to 0. Since it is impossible to distinguish and a sphere with 
radius there might be a good reason for such an additive separation to be expected. 

6 Proofs 

6.1 Proof of Lemma 7 (bad example for the nearest neighbor estimator) 

We will now construct a bad example for the nearest neighbor estimator. While it is well 
known that the nearest neighbor estimator can be non-optimal, we could not find a definitive 
reference for a large gap. In our example, we will demonstrate a large gap of sjn. Consider 
the ellipsoid 

n—1 2 

E n = {y = ( yi ,...,y n ): J^ + i^l}. 

i=i Vn 

Set a = 1. The orthogonal projection estimator M(y) = (0, ...,0, y n ) has minimax 
error 

n-1 

M(y) = ^yf + n(yn-yn) 2 ]<2. (7) 

i=l 

On the other hand, we show that the nearest neighbor estimator has error £l(y/n). For 
any y = (yi, . . . , y n ), by using Lagrangian multiplier, we have that the nearest point y to y 
on E n satisfies that y% = (1 + \)yi for i = 1, . . . , n — 1 and y n = (1 + X/\/n)y n . Now, pick 
y = (0, . . . , 0, n 1 / 4 ) G E n . Then with high probability J2™=1 yf = M(n). By 

n—1 n—1 

£^ = (i + a ) 2 E£;<(i + a) 2 , 

i=l i=l 

we have A = £l(^/n). But then y n < cy n < cn 1 ^ for some constant c < 1. Thus, with high 
probability \\y — y\\ = ^(n 1 / 4 ). So the nearest neighbor estimator has error ^(n 1 / 2 ). Since 
the projection estimator achieves the risk of 0(1), we have constructed an example to show 
that the nearest neighbor estimator can be £l(y/n) factor larger than the optimal. 
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6.2 Proof of Proposition 9 

It is well known that the error of the nearest neighbor estimator is determined by the 
metric structure of K. For two bodies K\,K2 C W 1 , define the (dyadic) entropy number 
ek(Ki, K%), for any k > 0, as the minimum e such that K\ can be covered by 2 k copies of 
ei^2- When K2 is the unit £2 ball, we simply write it as ek{X\). 

For a random vector g € G = G n (l) and any y 6 W 1 , let g y denote the random variable 
g ■ y G M. The classical Dudley bound states that there is a constant c > such that 

00 

E 9 ^[sup|^|]< C ^2 fc / 2 e 2fc (^) 

We need a slight variation of the above bound where the summation is over k above 
some threshold. For 5 > 0, write 

k(5) = Llog(min{fc : e k (K) < 5})\ , 

00 

k=K 

K{5) = Knq(5). 

With the above notations, 

Lemma 14. There is a constant c > 0, for any t > 0, 

Prob 9 ^g[ sup \g y \ > t"/(K,k(5))] < exp(-ct 2 2 fc ^) . 
yeK(S) 

Proof. By the standard chaining argument [37]. Clearly the result holds if we replace eu{K) 
with any upper bound of ek(K). □ 

Now we prove Proposition 9. Without loss of generality, we assume a = 1. We apply 
the standard technique to bound the error of the nearest neighbor estimator by the supreme 
of Gaussian processes [40, 34]. The starting point is the well-known observation that for 
V = N K (y), 

\\y-yf <2(y-y)-(y-y). (8) 

Since y,y G K = Xt\ and by the quasi-convexity of i\ for < q < 1, we have that 

y — y£ d 'K for d = 2i . Observe that g = y — y is a Gaussian random vector. We can 
bound \\y — y\\ through Dudley bound over i v q ball as follows. 

To apply Lemma 14, we need an estimate on the entropy number of K = X£ p q . Write 
A = || K || . The following is a consequence of [15, 24]. For completeness, we include the 
derivation in Appendix A. 



Lemma 15. 



e 2k (X£P,q) 



' 0(A) k < logp 

ff fq iM±+my /q - 1/2 A\ i ogP <k<p (9) 



k O (2- 2k /P (fq/p^/i-^A) k>p. 
where f q = 0(|ln|) is a constant dependent on q only. 
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Now the crucial lemma is 

Lemma 16. Suppose that A < p 1 / q (logp) 1 / 2 and A/p 1 / 9-1 / 2 < 5 < A, for any constant 
d > 0, there exists c(q, d) > 0, dependent on q and d only, such that 

q 2 — 2q 

Prob 9 ^g[ sup \g y \ > c(q,d) A 2 -i 5 2 -i yjlogp] 

veK,\\v\\<s 

Proof. The proof is by applying Lemma 14 and 15. By Lemma 15, for 

A/ p i/«-i/2 < s < A , 

we have, 



k(S) = 0((A/8)& logp) = 0(p) . 



Therefore, 



k=k(8) 

logp OO 
k=k(5) fc=logp 

By Lemma 15, it is easily seen that for both terms, the dominant term is the first term, 
i.e. when k = k(5) and k = logp, respectively. Plugging in et{K) for these values, we have 

l(K, k(S)) < O ^5\J (A / 5)^ log p^j + O (v^A/p 1 / 9 - 1 / 2 ) 

< O ^A^i^^v/logp + P 1-17 ^) • 
It is easy to verify that with 5 > A/p 1 / 9-1 / 2 , 

A^J^Tydogp > cAp^^yOogp, 
for some constant d > 0. So the first term dominates , that is 

q 2 — 2q 



j(K,k(S)) = 0(A^,5^ViogP)- 

The claim now follows from Lemma 14. □ 

With the above preparation, we are ready to prove Proposition 9. 

Proof. [Proposition 9] We assume a = 1. Recall A = \\K\\. We can further assume 

7bg~p~ < A < n 1 /V(log p)( 2 -")/ 2 <? . (io) 

Otherwise the claim follows immediately by using the trivial bound of 0(min(A 2 , na 2 )). 
Together with the assumption that p = Cl(n/ log n), the upper bound in (10) implies that 

A = 0(p 1/9 (logp) 1 / 2 ). (11) 
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Write S = cA«/ 2 (logp)V 2 -p/4 f or 

some sufficiently large c such that Sq > A/p l / q x l 2 . 
This is possible as A = 0(p l f q {\ogp) 1 ^ 2 ). Hence, by applying Lemma 16, we have that for 
Sq < 5 < IS. and any d > there exists c(q, d) > such that 

q 2-2q _rl(A\ q / 2 

Prob 9 ^g[ sup \g y \ > c(q,d) A 2 -i 5 2 -i ^/logp] < p \ s > 

yeK(S) 

Now denote by £ the following event 

3y (5 < \\y\\ < A) 

/ , 1 2-2q \ 

A (J^l > ty / logpA 2 -i \\y\\ 2-9 J . 

By the peeling argument we show that we can choose t, dependent on q only, such that 
Prob[£] < p- A / q . Define 

K(S) = K(S) \ K{S/2) . 
Clearly K(S) C K(S) and for any y G If (<5), ||y|| > S/2. By these we have 

Prob[ sup \ 9y \ > t q ^f^pA^\\y\\ q / 2 ] < p- d ^ q/2 . 
yeK(S) 

Hence for any d > 0, there is c(q, d) > such that 
Prob[£] 

. q 2-2q 

= Prob[ sup \g y \ > c(q, d)y/ logpA 2 -? ||y|| 2 ~i ] 
ye^,ll»ll>*) ' 

log(A/5„) 

E, <? 2 — 2q 

Prob[ sup \g y \ > c(q, d) ylogpA 2 -'? ||y|| 2 -i ] 

k=o yeK(2 k 5 ) 

log(A/<J ) 
< ^ p -d(A/(2 fc «5o))'/ 2 _ 

fe=0 

Now choosing <i = 4/g and setting t q = c(p,A/q), we have that Prob[£] = 0(p~ i l q ). Let 
z = y — V- So for \\z\\ > So, with probability 1 — 0(p _4//<? ), 

,,9 , / -2_i, ,, 2 - 2 l 

\\z\\ < 2\w ■ z\ < t q y / logpA 2 -i \\z\\ 2 -i . 

That is 

||2|| = 0{A q / 2 {\ogp) ll2 ~ q/i ) = 0{So) . 
Hence with probability 1 — 0(p~ i / q ), 

\\y-y\\ 2 = 0{S 2 ) = 0{A q {\ogp) 1 ' q l 2 ). 

Since \\y — y\\ < 2A < 2p 1 / q , we have that 

n\\y - y\\ 2 } < S 2 + 0(p^' q ■ 2 P 2 l q ) = 0(A q (\o gP ) 1 - q / 2 ) . 

For general a > 0, we apply the standard scaling formula of Rn(K,(j) = (j 2 Rn(K/(j, 1) 

1 . „ 

and complete the proof of Proposition 9. The constant of c q = 0(29 Mn-) comes from 
multiplying d and f q in Lemma 15. □ 
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6.3 Proof of Theorem 11 



To establish the lower bound, we consider the largest Euclidean ball of various dimension 
contained in K. Intuitively, we show that if Kolmogorov width of K is large then it has to 
contain a large enough Euclidean ball, in terms of both radius and the dimension, which 
allows us to nearly match the upper bound. The crucial technical tool is the restricted 
invertibility result by Bourgain and Tzafriri [8] and developed by Szarek and Talagrand [36] 
and Giannopoulous [23]. 

Definition 17. For a set of vectors S, let spanfS 1 ] denote the linear subspace spanned by 
S. A set V = {v±, . . . ,v s } is called 5-wide if for any 1 < i < s, dist(«j, span[y/{wj}]) > 5, 
where dist(t> , P) denotes the minimum distance between v and any vector in P. 

The following proposition can be gleaned from work in [8, 36, 23]. See [26] (Proposition 
5.2) for a proof. 

Proposition 18. For any 5-wide set V = {v±, . . . ,v s }, there exists S C {l,...,s} with 
\S\ > (1 — e)s such that for any a = (aj)j<zs, \\ YljeS OL i v j\\ — c \/ e / s $ J2jes \ a j\> where c is 
an absolute constant. 

We make the following observation. 

Lemma 19. Suppose that K = X£q and X = (x±, . . . , x p ). Then for any k > 0, there exists 
k + 1 vectors V C {x\, . . . , x p } such that V is dk(K) wide. 

Proof. For a set of points pi, . . . ,p s and k > s — 1, let voU(pi, . . . ,p s ) denote the fc-volume 
of the convex hull of p± , . . . , p s . 

We find k + 1 points V = {vi, . . . , Vfc+i} in K such that the k + 1 volume of the simplex 
spanned by the origin O and v\, . . . , Vk+\ is the maximum, i.e. 

V = argmax yii yk+ieK vo\ k+1 (0,yi, . . .,y k +i) ■ 

Since K is a compact set, V C K. We first show that V is dk(K) wide. Consider 
the /c-dimensional subspace P spanned by v\ , . . . , ■ By the definition of dk , we have 
snp yeK \\Py - y\\ > d k {K). Or equivalently 

supdist(y,span[{i;i,...,i; fc }] > d k (K) . (12) 

On the other hand, 

vol k+1 (0,vi,...,v k+1 ) 
= -^-^-vo\ k (0,vi, . . . ,v k ) -dist^fc+^span^i,...,^}]). (13) 

By the maximality of volfc+i(0,fi, . . . ,v k+ \) and (12) and (13), we have 

dist(t> fe+ i,span(«i, . . . ,v k )) > d k (K) . 

Repeating this argument for each Vi in V, we have that V is d k (K)-wide. In addi- 
tion, for K = X£i, K is the convex hull of ±xi,...,±x p . Hence for any projection P, 
argmax^g^ \\Px\\ has to be a vertex of K. That is V C {±Xj : 1 < % < p}. It is easy to 
see that V can be chosen such that V C {x±, . . . ,x p }. Since Xi q C Xl\ for < q < 1, 
d k (X£ q ) < d k {Xi{). This holds for any < q < 1. □ 
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Using Proposition 18 and Lemma 19, we have that 

Lemma 20. There exists constant c > such that for any K = Xi p q , k > 0, and < e < 1, 
there exists a linear sub-space P such that P n K contains an (1 — e)k dimensional £2 ball 
with radius Q(y/e(l - e)k l / 2 ~ l / q d k {K)). 

Proof. Clearly we can assume that dk(K) > 0. Let V be the d k (K)-wide set as in Lemma 19. 
Write S = {i: x { £ V}. By Proposition 18, let S C So be such that \S\ > (1 - e)|5 | and 
for any {a>j} je s, 

|| ^aiXiW > c^e/\S \d k (K) ^ \a>i\ . 
ies ies 

According to reverse Holder inequality, for x £ M. p and < q < 1, ||x||i > n 1-1 / 9 ||x|| g . 
Hence, for any {a-i\ such that J2ieS \ a i\ 9 = 1> ^2ieS \ a i 

I > . Thus if \\a\\ q = 1, then 

|| 5>^|| ^ c^7J\S^\d k (K)\S\ 1 - 1 /' 1 

ies 

> Cy /4l=7)\S\ 1 / 2 - 1 /<>d k (K) . (14) 

Let P be the sub-space spanned by Xj for i £ S. Since {xj}ies * s d k (K) > wide, they 
are linearly independent. That is K n P is fully (|<S|) dimensional. On the other hand by 
(14) for any v on the boundary of K D P, we have that 

iiuii ^cv^r^yisi 1 / 2 - 1 /^^). 

Hence, K D P contains an |5| -dimensional £2 ball with radius 

C yft(l=7)\S\ 1 ' 2 - 1 '«d k (K). 

The claim follows by \S\ < k and 1/2 — 1/g < 0. □ 

By Lemma 4, P*^^ ), cr) = ^(min(fccj 2 , r 2 )). In addition, by definition of minimax 
risk, for any K\ DK 2 , P*(Ki,cr) > R*(K 2 , a) (see for example [20]). Choosing e = 1/2, we 
have that for K = Xiq, 

R*(K,a) = fi(maxmin(fea 2 ,A; 1 " 2 /' ? 4(K) 2 )). 

k 

6.4 Proof of Theorem 1 

Proof. Let 

k* = argmax fc min(d k (K), k l ^ q a) . 

When there is a tie, we pick k* to be the smallest among the ties. Clearly < k* < n 
since d n {K) = 0. When k* = 1, it is easy to show the claim holds. For 1 < k* < n, we 
distinguish two cases. 
Case 1. d k *{K) > (fc*) 1 /^. 

In this case we have that d k * + i(K) < (k* + \) l / q a. Otherwise, we would have that 

min(4* +1 (K),(r + 1)VV) 
= (k* + l) 1/q a > {k*) l ' q a 
>d k *{K) >mm(d k *(K),(k*) l / q a). 
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This contradicts with the maximality of k*. Since d k *{K) > (k*) l / q o, k l ~ 2 / q d k ,(K) 2 > 
k*a 2 . We apply the lower bound in 5) and obtain that 

R*(K,a) = fl(k*a 2 ). 

For the upper bound, by taking k = k* + 1 in (4), we have 

Rh(K,*) 

= 0((k* + l)a 2 + c q d k . +1 {Kyo 3 -*(\ogp) 1 -«/ 2 ) 
= 0((k* + l)a 2 + c q ((k* + l^V^-^logp) 1 -*/ 2 ) 
= 0({k* + l)a 2 (\ogp) 1 -^ 2 ) 
= 0(R*(K,a)(logp) 1 - q / 2 ). 

Case 2. d k *(K) < (k*)^a. 

In this case d k * (K) > (k* — l) l / q a. Otherwise, we would have that d k * (K) < (k* - 1) 1 / 9 ^ 
and d k *(K) < d k *-±(K). The latter is due to that we pick k* the smallest k in case there 
is a tie. This would imply that 

min(4*-iOFQ,(fc*-l) lA ^) 
>d fc .( J K')>miii(d fc .(iir),(A;*) 1 /V). 

Again it contradicts with the maximality of k* . Hence for the lower bound, we have 
that 

R*(K,a) = n((k*) 1 - 2 ^d k *(K) 2 ) 

= n((fc*) 1 - 2 /«(ife* - 1) 2 /% 2 ) 

= n((k*) 2 a 2 ) , by k* > 1. 

Setting k = k* in (4), we have 

R H (K,a) 

= (k*a 2 + c q d k ,{Kya 2 - q {\ogp) 1 - q / 2 ) 
= 0(k*a 2 + c q {{k*) l l q o) q o 2 ~ q {\ogp) 1 ~ q l 2 ) 
= 0{k*o 2 {\ogp) 1 - q/2 ) 
= 0(R*(K,a)(logp) 1 - q / 2 ). 

Therefore, for any < q < 1 and p = Q(n/\ogn), for K = X£ q where X is an n x p 
matrix, we have that R H (K, a) = 0{{\ogp) 1 - q / 2 R*{K, a)). □ 

6.5 Proof of Lemma 13 

Proof. In what follows, all the statements hold with high probability, say 1 — 1/n 2 . 

1. Since y k — y k is n — k dimensional Gaussian vector, by the property of ^-distribution, 

\\yk - Uk\\ 2 <{n- k)a 2 + 2\/n\og na 2 . 

Since \\y k — y k \\ < \\y k — y k \\, the statement follows immediately. 
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2. Let z denote the nearest neighbor of y k on K k . So \\z — y k \\ = 5k- Further, 

(m - z) ■ (yk - z) < o . 

Following the same analysis for the nearest neighbor estimator, we have 

llyit - z f 

< 2{y k - z) • (y k - z) 

= 2{y k - z) ■ {y k - y k ) + 2(y k - z) ■ {y k - z) 

< 2(y k - z) ■ (y k - y k ) by (15) 

< ciC k d k a^J\ogp 
= 4cika 2 y/log p . 

Hence 



> 
> 
> 
> 



- ^11 2 + \\z - Vk\\ 2 + 2{y k -z)-{z- y k ) 

- A? + 2(j/fc - z)-(z- y k ) + 2(y k - z) ■ (y k - y k ) 

- z\\ 2 + 2(y k - z) ■ (y k - y k ) by (15) 

- z \\ 2 - 2\(y k - z) ■ (y k - y k )\ . 



(15) 



We bound these two terms separately. 

= \\Vk - Vkf + hk - z\\ 2 + 2{y k - y k ) ■ (y k - z) 
>{n— k)a 2 - 2y/n log no 2 + b\ - 4S k o \Aog~P ■ 

By the analysis for the nearest neighbor estimator, we have 

2|(yfc - z) ■ (y k - y k )\ < ciC k d k a^J\ogp = cika 2 ^J\ogp. 

Putting them together, we can take Sf, = ci{\Jn log no 2 + ka 2 \/logp) for some suffi- 
ciently large C2 and obtain 



\\Vk ~ Vk\\ 2 >{n- k)a 2 + 2yfn^gna 2 . 

3. \i b\< ci{yfn\ogna 2 + ka 2 logp), then according to the above 

\\y k -z\\ 2 <0(ka 2 \ogp) = 0{5 2 k ). 



Hence 



||yife - Vk\\ < \\Vk ~ z\\ + \\z - y k \\ = 0(S k ) . 



□ 
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A The entropy number of X£ q 

By Guedon and Litvak [24] (Theorem 6) 



r ©(i) 



k < logp 




log p < k < p 



(16) 



( e {2- k /p{f q /p) 1/q - 1 ) 



k > p . 
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where f q = 0(| In |) is a constant dependent on q only, 
and by Carl and Pajor [15], 

{0(A) k<logp 
O (2- fe /P(l/p) 1 /2A) yfc>p. 
From the definition of e^, we have (see also [33]) 

e kl+k2 {Ki,K 3 ) < e kl (K 1 ,K 2 )e k2 (K 2 ,K 3 ) . 
By (18), e 2k (Xf q ,q) < e k {f q ,^)e h {Xg{,q). So we have 

(O(A) fc < logp 

o((/,^M) V9 " 1/2 A) logp<*<j 
O (2- 2fc /f(/ 9 /p) 1 /^ 1 /2 A ) fc > p . 
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